Back to Part 6
Note: Interpreting BLAST search results
Let's consider how one might go about assigning a numerical value to the
degree of similarity between two DNA sequences. Suppose we have two sequences
as follows:
CGGCAT
CGCGAT
Let's assign 1 point for each base pair that matches exactly and 0 point
for each base pair that does not. We have C-C (match), G-G (match), G-C
(no match), C-G (no match), A-A (match), and T-T (match) for a total of
4 points. Under this hypothetical system, the more nucleotides that match
up, the higher is the score.
When comparing two DNA sequences, it's important to remember that because
of evolutionary history, the sequences may have diverged not only by substitution
of bases but also possibly by deletions or insertions of bases. This means
that the sequences that are being matched may not be exactly the same length
but might have gaps. In practical terms, for these two sequences, the best
match is
CGGC-AT
CG-CGAT
for a total of 5 points.
Another possible alignment is
CG-GCAT
CGCG-AT
for a total of 5 points.
From the simple example above, you can imagine how rapidly sequence comparisons
can become complicated as DNA length increases. The statistics for comparing
two sequences of DNA are thus highly complicated. Here we cover just the
bare essence of the topic so that you can interpret the response from your
sequence query.
Let's suppose you do a BLAST search of the following sequence:
TATCGCGTATTGCC
BLAST will come back with a result, starting with the reference of the
search program, the number of letters in your sequence, the number of
letters in the database, a graphic representation of the sequence matches,
and a list of matches. The list of matches is sorted with the best matching
sequences shown first. For the sequence we used, the list starts with
the following:
Score E
Sequences producing significant alignments:(bits) Value
gb|AC012156.14|AC012 Homo sapiens chr 12.. 28 5.8
ref|NC_001142.1 Saccharomyces cerevisiae... 28 5.8
What does this mean? "Score" is a numerical score assigned by BLAST. In
the simple example, we used earlier, we simply assigned 1 point for matches,
0 point for non-matches. In BLAST, the scoring system uses "bits" as the
measure of information. For DNA, each position can be occupied by either
T, A, C, or G. Each match therefore contains 2 bits of information (only
1 is correct out of 4 possibles). For a 14-nucleotide-long sequence like
ours, the maximum match score then is 28 bits. The higher the score, the
better is the match.
"E-value" is the number of hits
one can expect to see just by chance when searching a database of a particular
size. The value is defined as
E = N/n * m * n * 2-S
where m and n are the length of the two nucleotide sequences (measured
in base pairs), S is the bit score, and N refers to the total length of
all sequences in the database. The formula should make intuitive sense.
For example, if S is higher (i.e., better matches), you would expect to
see fewer "hits." On the other hand, if m or n are larger (i.e., one or
the other sequence is longer), then you would expect to see more hits
purely by chance. Finally, if the database contains more sequences (i.e.,
N is larger), then you would expect to see more hits. In any case, if
BLAST returns an E-value that is very small or close to zero, then you
probably have a meaningful match that is not due to random chance.
To interpret the matches, you therefore need to pay attention to whether the E-value is reasonably small. E-value is related to the P-value by the following formula:
P = 1 - e-E
So for a P-value of 0.95 (the statistically significant level), the E-value
is around 3. Thus, in your search, an E-value of 3 or less would be an
acceptable match.
You should also keep in mind that there are a lot of sequences in the
database and that some of them are from the same species and therefore
might be very similar. In some cases, the name of the organism may have
changed after it was originally reported; accordingly, two or more sequences
may match extremely well but appear to belong to completely different
species.
Consult the BLAST tutorial page for references and
descriptions of the statistics used in BLAST, click here (Internet connection required).
|